Most problems with the distributed communication derive from infrastructural issues, such as interruptions in the network connection, slow response times, or even timeouts while calling another service to name but a few. For years there have been plenty of frameworks for all of the common programming languages to secure these calls. Recently, however, so-called service mesh tools have been developed that enable the use of common resilience patterns. Thus the question arises, why should you achieve resilience by programming inside the services if you can repair the infrastructure’s problems in the infrastructure itself, i.e. in the service mesh?
The need for resilience in distributed systems
The need for resilient communication in distributed systems is certainly not new. The best known collection of misjudgments, the list of “Fallacies of Distributed Computing”, was published around the turn of the millennium. For years, however, it has not been dealt with very intensively in projects. One of the reasons for this was certainly the fact that monolithic systems, in which these problems were not the highest priority, were being built. The new trend of developing systems for the cloud, i.e. cloud native, is bringing these infrastructure problems back into focus. Additionally, it’s more and more common in such projects to develop a dedicated mindset focused on solving these problems from the outset.
STAY TUNED
Learn more about DevOpsCon
Frameworks help
One way to achieve resilience is to use frameworks. The best known representatives in the field of Java are Hystrix, Resilience4J, Failsafe and MicroProfile Fault Tolerance. All these frameworks offer more or less the same thing – help with the implementation of the following resilience patterns:
- Timeout
- Retry
- Fallback
- Circuit Breaker
- Bulkhead
With Timeout, Retry, Circuit Breaker and Bulkhead, the developer basically only has to annotate or secure his call method to make communication more error tolerant. Here is an example of a Circuit Breaker implemented with MicroProfile Fault Tolerance:
@CircuitBreaker(requestVolumeThreshold = 10, failureRatio = 0.5, delay = 1000, successThreshold = 2) public MyResult callServiceB() { ... }
In a rolling window of 10 successive calls (requestVolumeThreshold), the circuit breaker is set to the status open if the error rate reaches 50% (failureRatio). Subsequent calls are prevented for at least 1000 milliseconds (delay) before the Circuit Breaker is set to the status half-open. If two consecutive calls are successful in this state, the Circuit Breaker is reset to the close state.
With fallback, on the other hand, it must be possible to implement an alternative flow in the business logic if the actual call fails. This option is not always available and cannot be implemented without the existence of a domain-oriented alternative. Fallback can therefore be seen as a more specialized resilience pattern. The four other patterns, on the other hand, have a purely technical focus. We will come back to those, however, because we will first approach the topic of resilience in distributed systems from yet another angle.
Resilience in the Service Mesh
A network that consists of many distributed systems that also call each other is called a service mesh. Recently, suitable tools such as Istio or a< href=”https://linkerd.io/”>Linkerd have been available to manage and monitor such service meshes. These tools can now also be used on the most common cloud platforms. Both tools have one fundamental thing in common: the use of a so-called sidecar. This refers to a separate process that is installed on the target node next to the actual service. In the case of Kubernetes, a Kubernetes Pod thus consists of the service and an associated sidecar. Another important feature of the sidecar is that all communication to and from the service is routed through the sidecar process. This redirection of communication is completely transparent to the service.
It is possible to handle communication errors in the sidecar, which monitors and controls all communication. Following this approach, Istio also offers several resilience patterns which can be activated by Istio rules in the sidecar. Basically these are:
- Timeout
- Retry
- Circuit Breaker (with Bulkhead-functionality)
Compared to the list of resilience patterns in the frameworks, keen-eyed readers will notice the fallback and bulkhead options are missing at first glance. Since the fallback, as explained above, is purely specialized, it should be clear that a technical component like the sidecar cannot offer this functionality. But what about the bulkhead option? A bulkhead limits the number of parallel calls to another service; this prevents the calling service from using too many threads for the parallel calls, and from getting into trouble itself in the form of a full thread pool. Istio’s Circuit Breaker works by limiting the number of simultaneous calls it can make. If the number exceeds the threshold, the Circuit Breaker in the sidecar interrupts the call and acknowledges it with an error. Of course, the calling service should react to the interruption with a corresponding error message (HTTP Status Code 503 – Service Unavailable). Since the call from the service is routed through the sidecar to the called service, the Circuit Breaker acts like a bulkhead in this case, since it limits the number of parallel calls in the calling service and thus indirectly affects the calling service.
The definition of a Circuit Breaker with an Istio rule:
Listing 1: kubectl apply -f
apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: mycircuitbreaker spec: host: serviceb trafficPolicy: connectionPool: tcp: maxConnections: 10 http: http1MaxPendingRequests: 1 maxRequestsPerConnection: 1 outlierDetection: consecutiveErrors: 3 interval: 10s baseEjectionTime: 1m maxEjectionPercent: 100
This rule instructs the sidecar to allow a maximum of 10 connections (maxConnections) to the service serviceb (host: serviceb). In addition, a waiting HTTP request is accepted (http1MaxPendingRequests). It is also defined that if three consecutive errors occur at an interval of 10 seconds, 100% of the calls (maxEjectionPercent) are immediately blocked for one minute (baseEjectionTime).
For the sake of completeness, I would like to point out that service mesh tools offer much more functionality for managing, controlling and monitoring the service mesh.
If you decide to use a service mesh tool in order to make use of additional desirable functionalities, the question naturally arises as to whether the actual implementation of resilience is to take place by means of frameworks in the service or with Istio rules in the sidecar.
Frameworks vs. Service Mesh
The first thing a developer who has to take care of resilience does is implement a framework of his choice. The best known example is Hystrix. Unfortunately, Hystrix has not been further developed since the beginning of 2019 and has been in maintenance mode ever since. Developers who already use Hystrix must therefore sooner or later migrate to another framework. Fortunately, there are still enough frameworks to leverage, all of which do their job very well. Even if you are bound to one programming language (e.g. Java), you can choose between a number of available frameworks.
If the decision belongs to the development team, you will very quickly receive a variety of different framework suggestions, with different versions and behaviors regarding the interpretation of resilience. If you look at the different frameworks in detail, then there are even more differences in the implementation of the individual resilience patterns. Beyond that, over the course of a project different versions of the same framework might have to be used with individual services. If this happens, you might notice slightly different behavior between the different versions of the same framework. Thus each developer team has its own special and framework-specific learning curve.
Migrating the service to an alternative resilience framework is considered more effort than changing from one service mesh tool to another. In the case of migrating frameworks, all code must be re-annotated in each service, and afterwards the complete CI/CD run needs to be restarted. In addition to all that, a deployment of the new service is necessary, which can be omitted when embracing implementation with sidecar rules. Compared to rewriting the resilience rules, this can involve a lot more effort.
If, for example, the decision is made in favor of Hystrix, it is tempting to use Ribbon for client-side load balancing and choose the appropriate service registry in the form of Eureka. This makes sense, because these components are well tuned to each other. However, projects in which Hystrix has not been chosen may have a different preference when discussing the appropriate service registry. Both parties will almost certainly agree on one point: the use of two service registries is not desirable.
The sidecars in the service mesh also operate client-side load balancing, since the sidecar itself takes care of the routing. The deployment information, which is read by the cloud platform and implemented for routing accordingly, serves as the service registry.
The reason for the discontinuation of Hystrix is also explained on the GitHub page: Netflix has recognized that it is difficult to react promptly to runtime fluctuations with configurations of the resilience patterns. The new approach tries to react faster with adaptive implementations. Until this approach is successful and therefore usable, however, will probably still be some time.
The consequence behind this basic statement is that an adjustment of the resilience configurations in the individual services must be very quick and easy. Depending on how well you made the configuration of your services possible (Config-Files, Config-Maps, Config-Servers), the implementation will turn out to be more or less complex in the framework variant. Even with the most flexible approaches, it may be necessary to restart the services for the new settings to take effect.
In the case of a new or modified resilience rule, the rule simply has to be executed (kubectl apply -f …) and, after a short delay, all sidecars get the new values for the implementation of the rule. The subsequent adherence to the new rule will begin without a restart of the sidecars or the service. The service itself does not notice this.
Microservices without monitoring is like driving without lights at night!
Explore the Microservices & Software Track
Caution! Retry Storms ahead!
It is often the case that you determine the most suitable resilience pattern only after a few (wrong) attempts. Especially early on, the learning curve in this area is very steep. Even experienced architects or developers find it difficult to find the correct approach to resilience right from the start. This is partly due to very complex communication processes, which are also subject to strong runtime fluctuations. All of this might be exacerbated by a variety of problems, which could be caused by the infrastructure components involved. Even a simple pattern like Retry can lead to problems. Multiple attempts to compensate for the incorrect call can result in a flood of calls for a problematic service, which can lead to a further deterioration in service availability. Basically, this only worsens the situation. This is referred to as a Retry Storm.
In such a situation, what you need is a flexible change of the resilience pattern used, not just the possibility of adjusting the setting values. Regarding the programming itself, the developer has to become active again. In this case, the use of rules offers much more flexible possibilities, since this can take place simply with a new rule execution. If the desired effect does not occur, the old rule can also be reactivated very quickly. The try-and-error mechanism can therefore be much simpler and faster.
The purpose of using a Circuit Breaker is also to protect the called service. The half-open or open state is used to ensure that the called service can recover from high load phases under certain circumstances. It gets even more difficult for the service to recover if more inquiries continue to arrive at this time. Often this overload leads to a crash of the service. It is therefore essential that all callers allow this service a recovery phase. A single caller that does not adhere to this rule can cause the called service to crash anyway.
With a framework, each service must be equipped with a Circuit Breaker. A development team that does not do this can also cause runtime problems. In contrast, an Istio rule applies to all callers of the service. A black sheep can therefore not mingle with the herd.
Visualization
What use are all these considerations about resilience patterns if it is difficult or impossible to determine how the individual pattern behaves under heavy workloads or in error situations? It is for this exact reason that Hystrix has its own dashboard that can graphically display the states of the resilience patterns in use. With MicroProfile Fault Tolerance it is possible to expose this state data via metrics in Prometheus format. If a Prometheus server is then used, preferably together with Grafana, these metrics can be displayed graphically in the Grafana dashboard.
Istio follows the same approach. The Prometheus and Grafana servers are included in the standard installation. All metrics of the entire service mesh are collected out of the box and sent to the Prometheus server. The entire service mesh can be examined using various predefined dashboards. Of course, this also applies to the metrics that the services generate on the basis of the resilience patterns.
Grafana-Dashboard / Quelle: istio.io
Intentional errors
A further advantage of Istio is the possibility of Fault Injection. With special rules, Istio can generate a malfunction, which is executed by the sidecar:
kubectl apply -f apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: serviceb_failure ..<. spec: hosts: - serviceb http: - fault: delay: fixedDelay: 10s percent: 90 match: - headers: end-user: exact: user01 route: - destination: host: serviceb subset: v1 - route: - destination: host: serviceb subset: v1
With this rule a fixed delay (fixedDelay) of 10 seconds is generated when calling serviceb at 90% (percent) of the calls. The whole thing only comes into effect if the HTTP header (headers) contains a field with the name end-user that has the value user01. All other requests are forwarded normally. This way you can easily test the service mesh’s behavior if you notice delayed response times. In addition to timeouts, it is also possible to simulate errors within the requests with complete terminations. In this case the sidecar interrupts the call and reports back a suitable HTTP status code or a TCP connection error. It is very difficult to create a corresponding error simulation using the framework approach.
Spoilt for choice…
As is often the case in life, it’s not just black or white. Depending on the environment or circumstances, you have to choose one or the other shade of grey between the two extremes. The missing fallback at Istio can only be solved by using a framework. For all other resilience patterns Istio offers good alternatives to the use of a framework. Also the possibilities of error simulation with Istio are not so easy to simulate with frameworks.
The use of a service mesh tool is certainly associated with a high degree of complexity and increased effort. But as a reward for the extra effort, you not only get the resilience patterns, but also a whole lot more functionality that is essential for managing and mastering the service mesh. Every project should have a look at the toolboxes of Istio or Linkerd at least once.
Until the new Hystrix approach can be used, we will have to be patient and try to get our infrastructure under control by conventional means.
The developer or architect whose immediate reaction is to embrace the framework approach for the use of resilience should stop for a moment and consider whether this is the best decision. Because actually you should solve the infrastructure problems where they arise, namely in the infrastructure itself!